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Abstract 



We propose a method for the problem of real time chord accompaniment of improvised 
music. Our implementation can learn an underlying structure of the musical performance 
and predict next chord. The system uses Hidden Markov Model to find the most probable 
chord sequence for the played melody and then a Variable Order Markov Model is used to 
a) learn the structure (if any) and b) predict next chord. We implemented our system in 
Java and MAX/Msp and compared and evaluated using objective (prediction accuracy) and 
subjective (questionnaire) evaluation methods. Our results shows that our system outper- 
forms BayesianBand in prediction accuracy and some times, it sounds significantly better. 

keywords: Machine Learning, Interactive Music System, HCI 
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Chapter 1 



Introduction 



Good afternoon, gentlemen. I am a HAL 9000 computer. I became 
operational at the H.A.L. plant in Urbana, Illinois on the 12th of 
January 1992. My instructor was Mr. Langley, and he taught me to 
sing a song. If you 'd like to hear it I can sing it for you. 

2001: A Space Odyssey 



1.1 Motivation 

My studies on machine learning were driven by the question of whether machines are capa- 
ble of learning like humans, acting intelligently or interacting with humans for the accom- 
plishment of a task. Furthermore, as a musician I was challenged by the idea of whether 
machines are capable for creativity, either as autonomous agents or by interacting with 
human performers. 

The idea of developing a system that join a jam session and support musicians by providing 
accompaniments or improvising music came from my need to explore and understand the 
way that humans improvise music. Trying to mimic the way a young musician learn to play 
or improvise music, we researched and developed a system which is capable of providing 
chords to an improvising musician in a jam session in real-time. 

In this thesis, we utilised supervised machine-learning methods which have been successfully 
used in fields like computational biology, text mining, text compression, music information 
retrieval and others. The process we approached this problem can be summarised in the 
following sentence. Interacting with improvising musicians in real-time using experience 
learned from data (off-line) and the rehearsal (on-line). 

1.2 Goals and contributions 

The main goal of this thesis is the development of a system that will be able to infer an 
underlying structure of the improvisation and predict and play chord accompaniments. A 
challenge of such system is that it must work under the real-time constraint ; that is, 
it must predict next chord and play it without the musicians (or the audience) noticing 
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artificial latencies. One other issue that makes such system challenging is that the input of 
the system is melody. This introduce further complexity to the problem since there is no 
strict mapping from melody to chords. What is more, the melody on itself doesn't contain 
sufficient information to provide "correct" chords. Thus, we need a subsystem that will 
be able to extract from a melody an underlying structure. That is, a chord progression 
that best explains/match the melody. Such system, however, might introduce errors that 
get propagated to the predictor, thus careful design of both subsystems (inferencer and 
predictor) is needed. 

Analytically, the main objectives of this project are: 



• To develop a Hidden Markov Model using Viterbi algorithm that given a melody, will 
be able to infer the corresponding chords (from time to t). 

• To use the information from the Hidden Markov Model and a Variable Order Markov 
Model to predict next chord (time t + 1). 

• To develop current state-of-art system which is based on Bayesian Networks [12] so 
as to compare. 



• To evaluate the system using both objective and subjective evaluation. 



G 



C 



G 



Figure 1.1: Chord prediction given a melody task. The chords in red are the chords as found 
from Hidden Makrov Model. The question mark indicates that we have to predict that chord 



Our contribution with this thesis is the development of such system which is capable of both 
off-line and on-line learning, like a musician which train himself with practice songs (off-line 
learning) and also understand the structure and the tensions in a jam-session and adapt 
the performance (on-line learning). As a product of this thesis, a plugin and a standalone 
application was developed as a Java external in Max/MSP and Ableton live . What is more, 
as byproduct of the thesis we developed a framework for creating online questionnaires, 
parsers for MIDI and MusicXML files and several python scripts for statistical processing 
of musical data. A complete list and the repository of the files is given in the appendix of 
the thesis. 



1 Max/MSP is a visual programming language for multimedia and music programming. Ableton live is 
software for real-time music performance and composition. 
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1.3 Outline 

In the following chapter we will introduce the reader to the field of interactive music sys- 
tems and the methods which we used during this thesis. What is more, with respect to 
the non-musician reader, we provide a musical background which will be sufficient for the 
understanding of the thesis. 

In chapter 3 we describe our system and present the design choices we made so as to 
accomplish our project. What is more we present the settings and the topology of the used 
models and give a description of the data set used for training and testing of our system. 

In chapter 4 we present the results of the evaluation of our system. 

Finally in chapter 5, we give an interpretation of the results, discuss the project's contribu- 
tion and present a plan for future work. 



Chapter 2 

Background 



The old distinctions among emotion, reason, and aesthetics are like 
the earth, air, and fire of an ancient alchemy. We will need much 
better concepts than these for a working psychic chemistry. 

Marvin Minsky 

In this chapter we aim to provide a brief brief overview on related work and state-of-the-art 
in interactive music systems and we give an introduction to the reader on music theory and 
computer music. 

2.1 Interactive Music Systems 

Robert Rower coined the term Interactive Music Systems to describe the upcoming subfield 
of Human-Computer Interaction where machines and humans interact with each other in 
musical discourse. Music is the product of such interaction where the computer takes the 
role of either the instrument (e.g. wekinator) or the fellow musician (solo improvisation, 
chord supporter, etc). 

Unfortunately, taxonomy of such systems is still under development since there are several 
disambiguations with terms such as "interaction" or "musical systems". As Drummond 
[8] coins, there are cases that the term "interactive system" describes reactive systems 
that contain the participation of audience, where the main difference between reactive and 
interactive is the predictability of the result. In this section we aim to give a brief description 
of related work in interactive musical systems using a simplification of Rowe's taxonomy. 

According to Rowe [24, p7-8] Interactive Music Systems can be categorised using the fol- 
lowing three dimensions: 

1. score-driven or performance-driven: The performance which the system interact 
with is precomposed or impromptu (without preparation) 

2. Transformative, generative or sequenced: The system either transforms the 
input or generates novel music or playback of stored material 

3. Instrument role or player role: The system is an extension of the human perfor- 
mance or an autonomous entity in the performance. 
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Covering the cartesian product of those dimensions is out of the scope of this thesis. How- 
ever, we will use the score-driven vs performance-driven dimension into our taxonomy (figure 
2.1). 



our system 
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Score Driven 



Non-real-time systems 
Real-time systems 



Interactive Music Systems 



Performance Driven 



Accompaniment Systems 



Interactive Improvisation System 



Figure 2.1: Taxonomy 



2.1.1 Score Driven 
Music Plus One [21] 

Christopher Raphael [21] developed a real-time musical accompaniment system for oboe. 
The system interacts with a musician who plays a non-improvised music and provides 
accompaniments using precomposed information. For this, the system use Hidden Markov 
Model so as to follow a score at real-time. 

In detail, Music Plus One consists of three subsystems called "Listen", "Predict" and 
"Play". The first subsystem take as input the audio of the musician and analyse it so 
as to find onsets. However, "Listen" subsystem might introduce latencies that are crucial 
for the interactivity of the system. Thus, "Predict" subsystem tries to predict next events 
and thus improve system's response. Prediction gets improved as long as new informa- 
tion becomes available from "Listen" subsystem. "Predict" subsystem make use Gaussian 
mixtures. Finally, "Play" subsystem output audio by using phase- vocoding technique [9]. 
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2.1.2 Performance Driven 
Interactive Improvisation System 
Imrovised Music with Swarms [6] 

Blackwell [6] criticised previous systems for their lack of interaction with the musicians. 
What is more, he believes that they were mainly focused in modelling and encoding musical 
knowledge, providing to the user accompaniments or suggestions regarding harmony or 
compose algorithmically using the knowledge and rules (aesthetics) hardcoded from the 
programmer. His contribution was to develop a system using ideas inspired by the way bird 
organised in flocks. 

Mimicking swarms' for creating intelligence is not a new idea (similar idea to Minsky's 
Society of Mind), however it is the first time that it is applied for interaction and, moreover, 
in musical interaction. Music like flocks is a self-organizing system. Harmony, melody, 
chords are moving towards directions with respect an organisation. Musicians force the 
flock to different directions without breaking this organisation. Blackwell, in this work, 
created a system using original ideas initially developed by Craig Reynolds in his program 
"Boids" [22]. 

Genjam [4] 

John Biles in GenJam utilise genetic algorithms for jazz improvisation. Although it's orig- 
inal version in 1994 was not interactive in 1998 the system re-designed so as to improvise 
in a real-time context. 

Genetic Algorithms originally developed by Holland in 1975 [11] are biological inspired al- 
gorithms which mimic the evolution process in nature. They consist of a set of encoded 
strings called population. In GA terminology the strings are called genotypes. The al- 
gorithm evolve the population by applying mutations and combinations of the genotypes 
iteratively. Also, each time it evaluates the population by using an objective function, called 
fitness function. 

The system performs solo trading, which means that it respond to a melodic phrase by 
mutating it. The approach followed in GenJam is called "fours trading". That is, it listens 
the last four measures from the human participant and it saves it in a chromosome. Then, 
it calls a Genetic Algorithm which after several evolutions it produces a new population / 
music solo. 

A real-time genetic algorithm in human-robot musical improvisation [27] 

Weinberg et al, developed a system in 2009 which is able to improvise melodies in an 
interaction with a human musician. Their system actually is a robotic arm which plays a 
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xylophone. This system also make use of genetic algorithms so as to to improvise. 

The system is fed with melodies which then fragments and uses as population. Similarly to 
GenJam, each time the system is called to improvise it mutates and combines several genes 
from the population and evaluates them in means of how much they " fit" the melody which 
the system is called to answer. The method they used to calculate fitness is Dynamic Time 
Warping , which is very similar to Levenshtein distance 2 . 

Continuator [16] 

Continuator was one of the greatest achievements in the filed of Interactive Music Systems 
and inspired this thesis in great means. Pachet's system performs in an improvisation and 
is capable of providing continuations of a soloing melody in the same style. This is of 
great importance since it's the first time that style agnostic system achieves to retrieve and 
simulate the participants style. What is more, his work showed the ability of variable order 
Markov models to work in real-time context. We will describe the model and the system in 
detail in next chapter where we introduce variable order Markov models. 



Accompaniment systems 



BayesianBand [12] 

BayesianBand is a system which was developed by Kitahara et al in 2009. Their system 
provides accompaniments in improvised music using Bayesian networks in real time. As far 
as we know, this is the closest system to ours and thus we used it to compare with. 

In their approach they use Bayesian networks so as to infer next note and then predict next 
chord. What is of great importance is that they update the probabilities during performance 
so as to improve next note inference. That way they improve also prediction of next chord 
since, as we will se later, next chord depends on previous notes and chords predicted. We 
won't describe it in detail until we describe Bayesian networks, in chapter 3. 



1 Dynamic Time Warping is a technique for measuring similarity in patterns that vary in time. 
2 Levenshtein distance is metric used in calculating strings similarity. 
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2.2 Musical background 



2.2.1 Elements of music theory 
Notes 

A musical composition is very similar to a construction. Notes are the building blocks of 
(almost) every musical composition. They are the "Atoms" of Western music. 

Musical notes consist of 5 properties. Pitch, note value, loudness, spatialization and timbre. 
We will mainly focus in pitch and value. Briefly, loudness is how loud a note is perceived, 
spatialization is the position in the space (e.g. left/right) and timbre is the quality of the 
sound, the texture. 



Pitch class 


Frequency (Hz) 


A 


440.00 


A# 


466.16 


B 


493.88 


C 


523.25 


C# 


554.37 


D 


587.33 


D# 


622.25 


E 


659.26 


F 


698.46 


F# 


739.99 


G 


783.99 


G# 


830.61 


Figure 2.2: 


Note to frequency 



Pitch is how human perceives sound frequencies. For example a sound of 440 Hz in Western 
music is perceived and assigned to the pitch A. Usually we will refer to the pitch as pitch 
class since each pitch is a class of frequencies and not a specific frequency. More specific, 
human ear perceives as in the same same frequencies whose ratio 2 n ,n G Z. For example a 
sound with / = A4QHz is perceived in the same pitch-class as a sound with / = 880Hz. In 
figure 2.2 3 you can see a mapping of a frequency to a pitch class. More formally, a pitch 
class of a frequency A is: 

PitchClass(A) = {B : A/B = 2 n , n G Z} (2.1) 

However, we will refer to pitch class by a symbolic name, e.g. A is the pitch class which 
contains the 2 n ,n G Z multiplications of 440 Hz frequency. 

We won't talk about frequencies anymore. From now on we will only concern about pitch 



3 http://www.cs.nyu.edu/courses/fall03/V22.0201-003/notes.htm 
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classes notated after their english name ( {[A — G]} U {b, ft, _} ). 

The difference between two notes is called interval. The smallest interval is a semitone and 
two semitones make a tone. In Western music, a semitone is ^ of an octave, were octave 
is the minimum distance between two different pitches which are in the same pitch class. 
For example a note with a pitch of frequency 880 Hz is an octave higher than a note of 
frequency 440 Hz. 

Another element of Western music are the accidentals. The flat accidental b decrease the 
note which is applied to by one semitone and (J increase the note by one semitone. For 
example A$ is pitch class A raised by one semitone. As you can see, in figure 2.2 we 
included accidental notation to the pitch class. 

Also an assumption we make which we will discuss later is that two letters without acci- 
dentals have a distance of a tone, except E-F and B-C which have a distance of a semitone. 
That is because as we will see, there is a specific arrangement of the notes in frequency 
space. 

Another characteristic of a note we will discuss is note value, which is the duration in time 
or it's relative time distance from the next note. A whole note is a note which length equals 
to four beats. As we go deeper in the tree in figure 2.3, the duration is subdivided. For 
example a Half Note lasts two beats, a Quarter one beat, etc. 



o 



Whole Note 



O 



O Half Note 



Quarter Note 




Eighth Note 




Sixteenth Note 



Figure 2.3: Note hierarchy 



Finally another very important element of music is silence which only characteristic is 
duration. The musical name of silence is rest and it's value has the same meaning with notes' 
duration. Note value is what gives rhythmic meaning in a musical composition. Although 
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there are a lot more to be said about notes we will stop here since further explanations are 
out of the scope of this thesis. 

Rythm 

Notes are arranged in frequency space but also in time. This arrangement in time is what 
humans perceive as rhythm. A musical composition is characterised by a tempo. It's the 
heartbeat of music and is measured in Beats Per Minute (BPM). Notes are grouped in 
meters (also called bars). Each meter contain notes which duration sum is the same in each 
meter. In other words, meters split the composition in equally sized groups. The size of a 
group is in means of notes duration. 

Time signature specifies how many beats are in each meter. For example a common time 
signature is \ which means that we have four quarters of note, where unless differently 
defined, a quarter is one beat. 

Melody and Chords 

Melody is a sequence of notes of several pitches and time values. In Western music melody 
usually consists of several phrases or motifs or patterns. 

A chord is group of two or more notes that are played simultaneously. A typical form of 
chords consists of three notes (triads). A triad usually consists of the root, the third and 
the fifth, where the root is the first note of the chord. The third is the note which distance 
is three (minor) or four (major) semitones from the root and a fifth is a note which distance 
is 7 semitones. The quality of the third separates minor chords from major chords. It's 
easy to see that for 12 notes we have 12 + 12 = 24 triads (minors and majors). 

In a musical composition, chords are usually played in progressions revealing structures and 
patterns. Chord progressions and melody are highly dependent and each restricts the other 
under the rules of harmony. 

Tonal music, scales and Harmony 

Tonal music is the music which is organised around chord progressions. However, this 
doesn't make it special by it self. We need also a constraint of what chords are allowed and 
what not. This constraint is the scale. 

A scale, in simple words, is a set of notes that we are allowed to play. Under this constraint 
melody and chords are developed. In an improvisation act, musicians share and respect this 
constraint. It is the common ground. Scale usually characterise a composition, however, 
there are cases where the scale change (key modulation) and thus new constraints are 
introduced. 



2.2 Musical background 
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2.2.2 Computer music 

Finally, we will close this introduction to the reader with a brief description of some elements 
of computer music. 

2.2.3 MIDI protocol 

So far we saw how music is decomposed. However, we need a way to model music so as to 
be understandable by computers. Please note that by music we mean symbolic music, the 
blueprints of a musical composition and not the audio. 

Musical Instrument Digital Interface (MIDI) is a protocol which describes music in sequences 
of events. Those events can be streamed from an instrument (e.g. synthesiser) or saved 
in files. The most important events are those of Note On and Note Off. Each of those 
events have the specific format. 



MIDI event: t , 0x80 or 0x90 , [0 - 127] , [0 - 127] 

Time elapsed MIDI event type Note Number Note Velocity 

As we can see, midi events are sufficient to describe pitch classes, durations and time 
distances between notes. Chords can be seen as sequences of notes with t = 0. 

1200,note u on,41,90 
1392, note u off ,41,0 
1440,note u on,41,90 
1512, note u off ,41,0 
1530,note u on,41,90 

Figure 2.4: example of midi events 



Programming languages and tools 

So as to simplify the process in our thesis and focus in machine learning algorithms and 
methods we aim to make use of a audio/music programming language. After researching we 
found that the most appropriate language for this task was Max/MSP which is a visual pro- 
gramming language, originally developed by Miller Puckette in IRCAM and commercially 
developed by Cycling'74 4 . 

Our methods will be implemented mainly in Java language since we can easily develop 
plugins for the mentioned programming language. 



4 http://cycling74.com/ 
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Finally, for music synthesis we aim to use Ableton Live 5 which is a suite heavily used in 
computer music creation. 



5 http:/ /www. ableton. com 
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Graphical Models 

A graphical model, is a statistical model for describing dependencies between random vari- 
ables which are expressed in terms of conditional probabilities. Conditional probability 
represents the probability of an event given the knowledge of another event. For example 

P(" chord is Em" ("last note played is E") (3.1) 

denotes the probability that the chord to be played is E minor given that the last note 
played was E. 

As the name reveals, a graphical model is a graph G = (V,E), whose nodes V are the 
random variables and the nodes E describe the dependencies among nodes V. [19, ch. 3] 
and [5, p. 359]. 

The advantage of graphical models is that they simplify and reduce the complexity of the 
computation of joint probabilities in means that we can easily compute them by decom- 
posing them into a product of factors. We will now see a specific class of graphical models 
named Bayesian networks. 

3.1 Bayesian networks 




Figure 3.1: Example of graphical model 

Bayesian networks are graphical models with the extra property that the graph is directed 
and acyclic. Each node represents a random variable in means of Bayesian statistics. The 
edge of the graph represents dependency of the random variables and the direction show 
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which variable depends on which. For example, in figure 3.1, the model is interpreted as 
following. The probability of a chord playing is conditioned from the probability of a note 
currently playing and the note played before. 

Bayesian networks are very important due to the simplification that they introduce in 
computing join probabilities. One should sum over all possible different values of random 
variables . For example for a graphical model with binary random variables A, B, C, D and 
dependency chain A — > B — > C — > D, the joint probability P(A, B, C, D) is computed as 

P(A, B, C, D) = P(A)P(B\A)P(C\A, B)P{D\A, B, C) 

Without using the topology of the dependency graph, this computation might be inefficient. 
Bayesian networks provide a framework for simplifying the computations exploiting the 
dependencies. 

Let pa{n) be the function which returns the parents of a node n. For example in figure 3.1, 
pa(chord) is {current note, previous note} (random variables of the graph). 

Then for x = {x%,X2, ■ ■ ■ , xk} the joint probability Pr(x) is computed as: 

K 

p r (x) = Yl Pr(xi\pa(xi)) (3.2) 

For examples, for the model in figure 3.1 we have: 

Pr(Chord, Current note, Previous note) = (3-3) 
Pr(Chord| Current note, Previous note) Pr( Current note)Pr (Previous note) (3.4) 

When the random variables represents sequence of data (e.g. time series or sequence of 
data like proteins) then the network is called Dynamic Bayesian Network. 



3.1.1 Applying bayesian networks for chord prediction 

In the previous section, we saw how a graphical model can capture the dependencies among 
random variables with graphical models and we introduced a specific class of graphical 
models, the Bayesian networks. As we will see, we can use Bayesian networks to provide 
a fast solution for the problem of providing supporting accompaniments to an improvising 
melody. The method we describe is based on the BayesianBand system as proposed by 
Kitahara in 2009 [12]. 



3.1 Bayesian networks 
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Figure 3.2: Bayesian network 



A musician during a jam-session, tries to predict the music which will played in the future, 
based on his experience during the session. For example, a guitar player supports a fellow 
musician by playing chords which will be consistent with the melody. To do so, the player 
will try to predict the note which will be played and then decide which chord is more 
appropriate. This process can be decomposed into two phases: 

• Melody prediction phase where the musician predict the note which will be played 
next. 

• Chord prediction / inference phase where the musician decide which chord fit on the 
melody playing and the predicted note. 



Inference using bayesian network 

Lets define a chord sequence c as 

c = (ci, . . . ,c„) 
and a melody n as a sequence of notes 

n = (m, . . . ,n m ) 



As we saw in the music theory section, there is a sequential dependence in melody and 
chord progressions. Thus 

Pr(n t+1 \n) (3.5) 

and 



Pr(c t+1 \c) 



(3.6) 
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So as to simplify the process, Kitahara et al. [12] assume that notes and chords are 2nd 
order Markov chains. In that way, the complexity of the algorithm is kept low and thus a 
system which will decide and predict in real time can be developed. 

Under the Markov assumption probabilites 3.5 and 3.6 are simplified to 

Pr(n t+ i\n u n t -i) (3.7) 

and 

Pr(c t+ i\ct,c t -i) (3.8) 

The Bayesian model shown in figure 3.2 is driven by the following observations. First of all 
there is a sequential dependence in the melody and the chords progression. What is more, 
a chord depends on the note which it supports, hence, we add an edge from predicted note 
(nt+i) to next chord (q+i). 

Recalling the task, the system must: 

1) Predict the note that maximise the probability Pr(rit-\-i\nt, nt-i) 

o = arg D max Pr(n m = o\n t ,n t -i) (3.9) 

2) Then, find the value of c% latent random variable which maximise the probability 

c = arg c max Pr(c m | ra t+ i = o, c t = coi,c t -i = co 2 ) (3.10) 
where co\ and C02 are the values as observed. 

Please note each time there is a new note, the predicted chord is played and then predict 
next note. That way time-response of the system is reduced. What is more, each note gets 
12 note values, one for each pitch-class. Also, the chords they use are the 7 diatonic chords 
(chord allowed under the key constraint). 

Incremental update 

A main property in a jamming session is novelty and interaction [16]. That means that 
during the session, musicians should learn what fellow musicians play. In this case, the 
melody prediction should use knowledge from the session and adapt the prediction on the 
current melody playing. So as to achieve that, they update their model each time a new 
note of the melody is played. 

Thus, the probability Pr(n t +±\n t , n t -i) is computed as follows: 



3.2 Markov Model 
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Pr covpus (n t+ i\n u nt-i) + a log N{n t -u t t ) N %£* t ™) l) 
Priru+i \th, Th-i) = ; — tt~, ; ' (3.11) 

where N(nt-i,tt) and N(rit-i,tt,rit+i) are the frequencies that the notes nt-x,tt and 
n t -i,tt,n t +i appeared in that sequence during the session. Constant a is a parameter 
of the user and defines the novelty of the system. 

Probability Pr COTpvLS (n t +i\n t , n t -\) can be easily computed using the corpus. 



3.1.2 Assumptions, limitations and extensions 



One of the greatest advantages of this system is that it can easily adapt to the performance. 
That comes from the fact that the model is characterised by the a-mnemonic property of 
Markov chains. However, that assumption make also impossible to understand and follow 
a structure of the chord or note (melody) progressions. In our project we aim to improve 
this model so as to be able to catch higher order dependencies. So as to accomplish that 
we aim to use Hidden Markov Model with variable order Markov chains. 



3.2 Markov Model 



Markov models are graphical models where each node represents the state of a random 
variable at a specific time and each node depends only on the node corresponding to the 
immediate previous state. For example, we denote Xt the random variable which corre- 
sponds on a state at time t. The memoryless property (also called Markov property) states 
that the random variable corresponds at time t depends only on the random variable at 
time t — 1, or more formally: 

P(X t+1 \X 1 ,...\X t ) = P(X t+1 \X t ) 




v_y kj kj kj 

Figure 3.3: Example of markov model 
The possible values of Xt is called state set S. 

Transition probabilities P(Xt+i\Xt) are usually modelled using a transition matrix where 
Ojj = P(X t+1 = Si\X t = Sj), where Si,Sj € S. 
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One variation of Markov models is when each state depends on the previous n states. More 
formally that means that 

P{Xt+i\Xi, . . . \Xt) = P(Xt+i\Xt, Xt-i, . . . , Xt-n+l) 

This variation is called n-order Markov model. For the shake of simplicity 1st order Markov 
models are called simply Markov models. 

Until now we have seen some graphical models that will help us to describe the models that 
we used during this thesis. 



3.3 Hidden Markov Model 



Imagine that a Markov process (that is a stochastic process with the Markov property) 
generates data and we, as observers, observe not the actual process but the outcome of the 
process. This assumption is very important since we can simplify several tasks, as we will 
see later. Hidden Markov models are an extension of Markov models (sometimes known 
also as observable Markov models) with which we try to deal with such cases. 

Hidden Markov Models are a subclass of Dynamic Bayesian Networks and have been used 
successfully in speech recognition [20], in computational biology [14], in prediction of finan- 
cial time series [28] , etc 



starting probability 

P(X =l)= 7T(i) 



^ 

P(Y 2 =j\X 2 = i) 





emission probability 




transition probability 
P(X t+ i=i\Xt=3) 




hidden variables 



observed variables 



Figure 3.4: Example of Hidden Markov Model 



The HMM (Hidden Markov Model) consists of two types of random variables. The observed 
and the hidden variables. In figure 3.4 X{ represents the hidden random variables and Y% 
the observed random variables. As you can see, Yi is independent to Yj, i ^ j and for Xk 
we have the Markov property P(Xk+i\X)~)- X\ 6 S is a discrete random variable, where S 
is the set of possible states of hidden random variable. We denote as N = \S\ the number 
of possible states of each hidden random variable. 
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The transition probabilities P{Xk+i\Xk) are usually stored in a rectangular matrix where 
ay = P(Xk+i = Si\Xk = Sj) and Si,Sj £ S. The initial probability distribution P(Xq) 
is called prior probability and is stored in a vector n(i) = P(Xq = Si). In figure 3.5 we 
gathered this information in a table. 

The observed variable can be either discrete random variable or continuous random vari- 
able. The probability that a hidden random variable Xj~ emitted an observation Yk = Ok 
is called emission probability. Please note that Ok is the observation at time k. The obser- 
vation might be a class, an integer, a real value and as we will see later, even a histogram. 
Depending the observation the observer random variable is either discrete or continuous. 
For example, imagine the stock market as a Markov process where the observed variable 
is the price of a share. The observed random variable is continuous and a common way 
to model such variables is using Gaussian distribution iV(/i, a). The emission probability 
bj(Ok) = P(Yk = Ok\Xk = Sj) is a probabilistic function that models this probability. 



Xk 


hidden random variable 




Y k 


observed random variable 




S 


set of possible values of hidden random variable 




N 


number of possible values of hidden random variable 




dij 


Transition probability P{Xk+\ = Si\Xk = Sj) 




bj(O k ) 


The probability that a state Xk at state Sj emitted an observation Ok 


A 


Matrix that contain transition probabilities A(i, j) = 




B 


Function that represent the emission probability B(i,j) = 


bj(Ok). In 




the case of discrete observed random variable, this function 


can be rep- 




resented with a matrix. 




7T(») 


Initial probability P(Xq = Si) 




9 


Parameters of the Hidden Markov Model (HMM) 9 = (A,B 





Figure 3.5: Notation of HMM (Hidden Markov Model). 



Problems to solve with HMM 

Hidden Markov Models are very useful in solving the following three problems [20] : 

Probability of an observation: For a given sequence of observations O = Oo, ■ ■ ■ , 0% 
and a model 6 = (A, B, 7r)[TODO: Explain them] we want to compute the probability that 
P(0\9), that is the probability of the observation given the specific model. 

Model learning: Given data we want to optimise the model parameters 9 = (A,B,tt), 
such that P(0\9) is maximised. More formally : 

9* = argmaxP(0|6>) 



Most probable hidden state sequence: For a given sequence of observations O = 
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Oo, . . . ,Ot and a model = (A, B, ir) we want to find the most probable sequence of states 
of hidden random variables, that is the sequence of states that best explain the observations. 

In this thesis we will mainly focus on the last type of problem. The problem we will try 
to solve is to find the most probable chord sequence (latent/not observed variable) that 
explain the melody played (observation). 

Learning the model's parameters 
Most probable hidden sequence and dynamic programming 

The most probable hidden sequence of a Hidden Markov model is also know as the " optimal" 
state sequence, associated with the given observation sequence, given that the "optimality" 
criterion (or objective function) is the probability of that sequence (P(Q|0, #), where Q = 
Qi,Q2, ■ ■ • )Qt is the hidden state sequence, O = 0±, O2, ■ ■ ■ , is the observation sequence 
and 8 = (A, B, n) are the parameters of the model). 




12 t 

Time 



Figure 3.6: Trellis view of Viterbi algorithm 

A trivial solution to that problem might be to compute the probability of each possible 
sequence. It's easy to see that the number of possible paths increase exponentially to time. 
For example of a state set of size N and an K observations the number of possible paths is 
K . In figure 3.6 you can see a generic example of an HMM (Hidden Markov Model). The 
highlighted nodes are the "optimal" states and the path in red is the "optimal" path (also 
known as Viterbi path). 

A more feasible and trackable solution to this problem was given by Andrew Viterbi in late 
60s [10]. The algorithm we will present you make use of Bellman's optimality principle 1 
for the design of a dynamic programming algorithm. 

1 Bellman's principle of optimality - The globally optimum solution includes no suboptimal local decision. 
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Algorithm 

There are several variations of the algorithm, however in this thesis we aim to use a variation 
that is similar to an algorithm know as forward algorithm. 

The dynamic programming recurrent relations are: 

V , k = P{O \X = S k ) ■ n k (3.12) 

V t>k = P(O t \X t = S k ) • maxK fc • V t _ M ) (3.13) 

i 

Vt^k represents the probability of the most probable state sequence until time t for Sk as 
final state. However, what is of our interest is the path and not only the probability. For 
this we use the trick to save pointers to parent states in the max step of second equation. 
The matrix pa(k, t) contains the parent of state Sk at time t. 

The retrieval of the Viterbi path for an observation of length T is computed using the 
following recurring formulas: 



y T =argmax(y Tii ) (3.14) 
y t -i =pa(y t ,t) (3.15) 

Complexity 

In the computation of the probabilities we can see that initial step 3.12 takes time O(N) 
since we have to initialise Vq^ for all = N possible states. The second step 3.13 needs 
0(N 2 ) time since we have to update Vt k for all TV states and for each update we have to 
compute the maximum quantity maxj(o^ • Vt-ij). Also, by noting that this step will run 
for every previous time (= T) the overall complexity is 0(T ■ N 2 ). As you can see Viterbi 
algorithm improved the complexity from 0(K N ) to 0(T ■ N 2 ). 

To close this brief description of the background theory we will introduce you to the concept 
of Variable Order Markov models which will be the component that makes our approach 
and system different to the current state of art. As we will see, this model allows us to 
capture repetitive patterns of chords and enhance the predictive power of our model. 



3.4 Variable Order Markov Model 



In this section, we introduce an extension of the fixed order Markov chain, the Variable 
Order Markov model (VOM) in order to increase the prediction power of the system. The 
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input will be a sequence of events and we aim to predict the next probable event. 

As we saw in the previous chapter, it is possible to model dependencies between chords and 
notes by using Bayesian networks. We will now see another probabilistic model which has 
been used in interactive music system of Francois Pachet, the Continuator. [16] 

In his paper, Pachet, described a system which has the ability to provide continuations of 
the musician's input. Continuation of melody is a prediction of the future of the melody 
which a musician played. It's easy to see that a simple 2 order Markov model can't be 
used for this task. We need a model which captures higher dependencies. Based on this 
idea, Pachet utilized Variable Order Markov models (VOM) so as to generate continuations. 
Let's discuss this idea more formally. 

A learner is given a sequence q = {gi, q%, . . . , q n } where qi G X and qiqi+i is a concatenation 
of qi and qi+i- The problem is to learn a model P which will provide a probability of a 
future outcome given a history. 

More formally: 

P(cr\s)- the probability of a given a suffix s 

where s S X* is called "context" and <j£S. £ is a finite alphabet (in our case this might 
be the possible notes or chords). The model P can be seen as the conditional distribution 
probability. Observe that the "context" can be, for example, a melody previously played 
and a the note which will be played next. Let's assume that the "context" is a melody 
(note sequence). Then, the prediction of the next note given a sequence of notes requires 
the computation of the probability of a note - symbol a given a sequence s. Let's leave this 
for a while and discuss a bit more about variable order Markov models. 

In 1997, Ron et al. [23] showed that a variable order Markov model can be implemented 
with a subclass of probabilistic finite state automata, the Probabilistic Suffix Tree (PSA). 
Those trees provide a fast way to query a tree with a sequence s and retrieve the probability 
P(a\s). PSTs have been used extensively in protein classification [3] due to their speed of 
construction and lower needs of memory than HMMs. Please note that PSTs should not 
be confused with suffix trees, although they share common properties. The main difference 
is that the PST is a suffix tree of the reverse training tree [3]. 

The time and space complexities as reported in [23] are presented in figure 3.7 ( D is the 



Complexity 


Description 


0(Dn 2 ) 

0(Dn) 

0(D) 


time for construction (learning) 
space 

time for the query 



Figure 3.7: time-space complexity of PST 



bound of the maximal depth of the tree ). Although it seems a bit restrictive the fact that 
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the maximal order of the VOM is bounded, it is justified by the fact that it is impossible 
even for an experienced musician to retrieve from memory infinite long sequences. However, 
in 2000, Apostolico et al. [1], improved PST's complexities to 0(n) for learning and 0(m) 
for predicting a sequence of length m. 




Figure 3.8: example of PST tree for £ = {a, b, c} - Each node is labeled after the nodes we meet 
when we follow the path to the root. Let's assume we want to find the probability of P(c\"bc"). 
Then, after querying the tree with the reverse of the string "be" we end to the highlighted node. 
Thus, the probability is 0.8. If we wanted to sample the next symbol of the sequence "be", we 
should just use the vector (0.1, 0.1, 0.8) as probability distribution for the symbols a, b, c. 

3.4.1 The Continuator 

Inspired by Ron's PSTs, Pachet simplified the data structure so as to be able to provide 
continuations. His approach is different than Ron's since the tree keeps information for all 
the played sequences (lose less). What is more, instead of computing the probability for 
each symbol, a pointer to the position of the initial training sequence is saved. That way, we 
are able to have access to several other aspects of the sequence's element, such as velocity 
or duration. 

The construction of the tree has time and space complexity 0(n 2 ) since in worst case we will 
have to insert all the sub-sequences (0(n 2 )) of the string in our tree. The tree initially is 
empty. T[symbol] returns the child of the node T which is connected with label symbol. If 
there is no such transition then it returns NULL. Each node has a list of pointers pointer list 
to the position of the next symbol. For example for a sequence abb we will have a path 
root —7- b —7- a which pointer list contain a pointer to the second b symbol. 

In figure 3.9 you can see such a tree constructed from the sequence abed and abbe. The 
dashed arrows show the pointers to the position in the learning sequence. So as to predict 
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Algorithm 1 learn sequence(s) 



For i=l To length(s)+l 
T Tree root 
For j=i Down To 1 

If T[sj] / NULL Then 

T. pointer _list.append(i + 1) 

Else 

T.insert(sj,i + 1) 

End If 

T <- T[8j] 
End For 
End For 



the next symbol of a sequence we have to traverse the tree with the reverse of the query 
string and then sample uniformly at random a pointer from the pointer list. For example 
for the query ab we search ba in the tree and then sample a pointer from the pointer list (c 
from 1st sequence or b from 2nd sequence) with equal probability. 



training sequence 1 




a b . b c 

training sequence d _ _ . 



Figure 3.9: Result of learn_sequence("abcd") and learn_sequence(" abbe"). 
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and implementation 



Music gives a soul to the universe, wings to the mind, flight to the 
imagination and life to everything. 



Plato 



4.1 Anticipation and surprise in music improvisation 

A crucial aspect of a jam-session musician is the anticipation. That means that jam-session 
support musician should be able to predict musical events and provide a musical stability 
in the performance. For example, a musician responsible for chord accompaniments must 
be able to understand patterns in music of the improvisation and try to predict and protect 
the stability of this structure. That way the soloing/improvising musician will depend on 
the assumption that the structure is stable and thus will be able to improvise in harmony. 

Our observation was that during a jam rehearsal there is a convergence of a stability as the 
time pass. Initially there is no preparation and thus no structure established. However, all 
the musicians tend to introduce ideas and variations that might enter the theme or forget 
them. Depending his/her role in the rehearsal, the musician will either be the one that 
introduce the innovation or the one that will try to provide accompaniments. The goal of 
this thesis is to develop a system that will serve the duties of the second. 

The input of the system if a musician that improvise polyphonic or monophonic melodies. 
Without any prior knowledge about the music played the task is to predict next chords and if 
possible converge to an underlying structure. This task has several not-so-obvious problems 
that we must solve. One of the greatest problems is that we don't have any information 
about the chords that have been played. Thus, our system must have a subsystem that will 
be able to infer chord sequences according the history of played music. 

Another problem is that we have to predict next chords using information learned during the 
rehearsal (online learning). For example, in 12-bar blues the structure (chord progression) 
is strict and repeated every 12 bars. However, an improvisor who plays on 12-bar song 
will not play strictly the same melody but will try to play in harmony with the chords. 
How can we l)find the chord sequence that fits the history and 2)understand the structure 
and predict next chord? Those two questions are answered using two model that we have 
already discussed in previous chapter. 
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4.2 Our method 

The system designed and tested in this thesis consists of two subsystems. The inferencer 
and the predictor. Before we describe each subsystem separately we will show you the 
interaction of those two subsystems. 



The input melody 




Play the prediction 



What chord sequence best explain the melody? 



Chord inference 



Chord predictor 



The most probable chord sequence is C,G,C,G 
Given that chord sequence is CGCG what's next ? 



Learning sequence 



Prediction model 



1' 

Our guess is C 



Figure 4.1: Brief system overview 



As we can see in figure 4.1 the system initially takes as input the melody played by a 
soloing musician. As you can see there is no knowledge about the chord or the structure. 
The inferencer using knowledge learned from corpus try to find a chord sequence that explain 
the melody. That means that this chord sequence might be in the musician's mind or played 
from other peer musicians and we have no way to interact with them. The inferencer outputs 
the sequence and feeds that sequence to the predictor. The predictor then use that sequence 
to learn the patterns and add this knowledge to the chord predictor. The chord predictor 
then using the chord sequence will try to predict next chord. As you can see, our aim is to 
improve the prediction using knowledge learned from the inferred sequence. 



4.2 Our method 
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One key limitation to our design choices is that each subsystem must work fast enough so 
as to be able to work in real-time. That means that the models and algorithm that we 
choose should be run very fast in a descent computer. Also, so as to simplify the task we 
made the following assumptions: 

1. Tempo is fixed. That means that it doesn't depend on the performance. 

2. Chord changes occur at bar change. That is that each bar is assigned one and only 
chord. 

3. There doesn't occur any key- modulations. That means that the key remains constant 
during the performance. The recognition of key changes is a research topic on itself. 

4.2.1 Chord modelling 

Chords modelling and representation is of great interest on this thesis since this decision 
affected mostly the performance and the output of the system. We will describe 3 repre- 
sentations that we tried and what was the drawback of each representation. 

The first idea was to model each chord as a 12-vector followed by the pitch of the root. Each 
position of this vector represents the note relative to the root that is played. For example 
for a Cmajor chord we have that notes at position relative to the root 0, 5, 7 are played, thus 
Cmajor will be represented as (C, < 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, >). It's easy to see that this 
representatios is very "hard". That means that for similar chords (e.g. Cmajor and Cmaj9) 
we will have different representations. This "hardness" of this representation incresed the 
" zero- frequency" problem since most of the songs had several variations of chords. That 
happens because usually musicians have a basic chord structure on their mind and then 
they introduce several stylistic note into the chords. The second representation we will see 
tries to deal with this problem. 

In out approach we decided to include musical knowledge into chord representation and 
simplify chords in a way that similar chords has the same representation. But what is the 
similarity threshold? We decided to split the chords according the quality of the 3rd and 
5th interval. That way we separated chords to 5 chord types: 



Chord type 


Third interval quality 


Fifth interval quality 


Major 


major 


perfect 


Minor 


minor 


perfect 


Augmented 


major 


augmented 


Diminished 


minor 


diminished 


Suspended 


none (2nd or 4th instead) 


perfect 



Figure 4.2: chord types 
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Thus we have for each of 12 roots, 5 different chord types, thus 60 different chords in total. 
As we will see, that chord simplification is enough so as to train and make our system 
functional. 

This issue has been also studied and researched in literature [18]. 
4.2.2 Chord inference: Hidden Markov model 

As we already saw this subsystem is responsible in answering the question "What is the 
chord sequence that explains the melody played?" . In Background chapter we saw a specific 
model and algorithm that answers that question. Hidden Markov models is the model we 
chose for this task due to their simplicity, time complexity and because they are most 
appropriate model for answering that kind of question. So as to apply Hidden Markov 
models (HMM) on this task we have to set the question in a probabilistic context. 

We have two kinds of random variable sequences. The note sequence that is fully ob- 
servable from the system and the chords that are latent (not observable). We denote 
O = Oq, 0\ , . . . , Ot the sequence of the random variables that corresponds to observable 
melodic level (Oq is for bar 0, etc) and with C = Co, Ci, . . . , Ct to the hidden random 
variable sequence that correspond to the chord level. 

P(structure|surface) = P(chord sequence | melody) 

An interesting question that we had to answer was what will be an appropriate represen- 
tation for the melodic level's random variables (O). The answer was inspired by MySong 
system [25] which is a system developed by Microsoft for harmonisation (chord inference) 
of vocal tracks. The key observation is that each chord supports melodies that have specific 
distribution of it's notes in each pitch class. That means that a Cmajor chord will support 
melodies that have most of their notes in C pitch class some of them also in G and almost 
none in or G#. Thus, we can assign each chord with a distribution of pitch classes. 
Those distributions can be easily learned from appropriate data set. 

Emission probabilities 

As we saw , we map each chord with a discrete distribution of pitch classes. This represents 
how much of each note stayed in a pitch class during the playback of this chord. Although 
we will give details about training and data set later we will give a description of the way 
we learn those distributions. 

We denote with \x c the 12-demensional vector that correspond to the pitch class distribution 
of chord with index c. That means for the chord with index 1, [i\ represents the percentage 
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Cmajor 



F F# 

Pitches 



A# B 



Figure 4.3: Cmajor pitch discrete distribution 



of time that melody spent in each pitch class, 
e.g 

Hi = (0.43,0.2, ...,0.001) 

Also, fiij is the percentage of time melody spent in pitch class j for chord with index i. 

The following process is followed so as to learn jjl. For each bar we initialise a temporary 
12-dimensional vector t = (0, 0, 0, . . . , 0). Each element U of this vector represents the 
percentage of time that a note spent in the corresponding i-th pitch class. The following 
process compute that vector. For each note in the bar we extract the octave information 
so as to assign it to a specific pitch class (e.g. C3 to C pitch class). Then we add the note 
duration to it's corresponding vector position. For example for a half note of C3 we increase 
to by 4. Next, we normalise the profile so as avoid problems that might occur by the change 
of the time signature (although we assumed that this won't happen that way we are safe in 
case we remove that assumption) . Then we update fi c j = fj, c j + pcj where c is the index of 
the chord that accompanied that bar. When the training finish we normalise again fi C: j so 
as Ylj f^cj = 1- That way /ij corresponds to a probability distribution. 

So far we've seen a way to learn the distributions from data however we need a way to 
model emission probabilities - that is the probability a chord have accompanied the specific 
observation (group of notes). 

We denote with h a 12-dimensional vector which contains the pitch-class distribution of the 
melody of the observed bar. The computation of h occur with the same process we used 
to learn [x (figure 4.4). Thus now we have to find a way to compute P(melody\chord). It's 
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note sequence 



A,G,B,C,B,G,A,F 

pitch list 



note 


freq 


A 


0.25 


B 


0.25 


C 


0.125 


F 


0.125 


G 


0.25 



pitch histogram 



pitch frequencies 



Figure 4.4: Note to histogram process 



easy to observe that the probability of a note n played given a chord c is 

P(n\c)= fJ, i>pc ( n ) 

where pc(-) is a function that maps a note to it's pitch-class. For a bar bar a set of notes, 
the probability that it was accompanied by a chord with index i is 

T T 
P(bar\c) = Y[P{barj\c) = JJ Mc,pc(6ar f ) 

3=1 3=1 

where barj is the j-th note of bar and T it's total number of notes. So as to avoid small 
probabilities and underflows we will work with log probabilities instead. Thus 

T T 
logP(bar\c) = ^ log P(barj\c) = ^ log ti C)PC ( bar .) (4.1) 

3=1 3=1 

However we notice that notes that are in the same pitch class will produce the same proba- 
bility log Hi : p C (barj , ) • If we count the number of notes in each pitch class and we assign them 
in a vector that contains such information we can replace the equation 4.1 with 

12 

logP(6or|c) = J^tj--log//«y- (4.2) 

3=1 



Also note that hi = t(/T thus ti = hi ■ T. By replacing this to 4.2 we have 
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12 

log P(bar\c) = ^ tj • log fi, 



■c,j — 



^hj-T- log /j, Cjj oc ^2 hj ■ log /x, 



12 12 



(4.3) 



To summarise, for a bar bar we extract it's pitch-class distribution and we assign it to a 
12-dimensional vector hi. Then, to compute the probability the a bar is accompanied by a 
chord with index i we compute the dot-product of h with the logarithm of /Xj. 



Now we have to find a way to represent chords and learn the transitions between them. 
Transition probabilities 

As we described, Hidden Markov models need emission probabilities and transition proba- 
bilities to be specified so as to work. In our case we have already seen emission probabilities 
and now we will see how to compute P(chordt+\\chordt). 

Initially we have to find an appropriate representation of the chords. That was one of the 
biggest and most difficult decision we had to make in this thesis since the whole system 
could suffer by a bad representation. As we saw in the Background chapter the chord is 
a set of notes that played simultaneously. Recalling from music theory, chords in western 
music consist of the root note, the 3rd interval note and 5th interval note. For example a 
major chord consists of the root note, a minor 3rd (that is 4 semitones) and a perfect fifth 
(7 semitones). According the quality of the 3rd and 5th interval of the chord we distinguish 
5 types of chords (minor, major, diminished, suspended, augmented). 

So as to represent a chord we followed this simple representation. Each chord is named by it's 
root name followed by the type of chord e.g. Cmajor. As we will see in the experimental 
results, the number of different types of chords that we allow affected significantly the 
performance of our system. So as to proceed with our explanation we assume that we have 
only two types of chords - minors and majors. Thus 



where Root = {C, C#, D, D#, E, F, F#, G, G#, A, A#, B}. Each chord t is an integer that 
correspond to a chord at bar t. The mapping will be described later. 

As we know, the hidden level of a Hidden Markov model (HMM) is a Markov model of order 
1. That means that P(chordt+i\chordt) can be specified by a m x m transition matrix Aij 
, where m is the number of chords, which contains the probability that a chord with index 



log P(bar\c) oc h ■ log/Lt c 



(4.4) 



chordt £ Root x {minor, major} 
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i will transit to a chord with index j. Give chord sequences extracted from data set those 
probabilities can be learned by simple counting. 



P(chordt+i = i\chordt = j) = Ai 



# of transitions from i to j 
# all transitions = | training sequence | — 1 



(4.5) 



observations (note histograms) 
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prediction 





hidden states (chords) 



past future 



Figure 4.5: HMM for chord inference 



Initial probabilities 

The only part missing is the initial probabilities. That is the probability of chord i to be 
the first chord in the sequence P(chordo = i). This probability can be easily learned from 
training set 

, , # sequences starting with chord i 

TTi = P{chord = i) = — — — 

# of training sequences 



Chord inference 

To summarise our inferencer the topology of Hidden Markov model can be seen in figure 
4.5. At time t we assign the pitch-class distribution of the bar to observation ot and the 
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emission probability is 

log P(pt\chordt) = a + h • log fi c 
, where a is a normalising factor, and the transition probabilities 

# of transitions from i to j 



P(chordt+i = i\chordt = j) = Aij = 



# all transitions = | training sequence) 



Our task is now to infer the most probable chord sequence C = {chordo, chord±, . . . , chordt} 
for a history of observations O = {oq,o\, . . . ,o t }. That is 

P(C|0) 



We have already seen an algorithm that allows us to answer such questions very fast 
Viterbi algorithm. The objective function that we like to maximise is 

L = logP(C|0) 



However, recalling Bayes theorem we have that P(C|0) = P( ' Q j^Q^ C ' > oc P(OjC) • P(C). 
Thus, logP(CjO) oc logP(0|C) + logP(C) and by maximising the right part we maximise 
the left part too. The objective function becomes: 

L = logP(0|C)+logP(C) 

and the task: 

C* = argmax(logP(0|C) +logP(C)) 
c 

The recurrent equation of the Viterbi algorithm for this task are: 



Vb,fc= log P(O c | C = k) + log ^ (4.6) 

V t , k =\ogP{Ot\C t = k) + maxlogA iifc + V t - 1<k (4.7) 

i 

after replacing logP(0 |C = k) with emission probability formula (equation 4.4) we have: 



V 0j k =ho ■ log n k + log ir k (4.8) 
V t)k =ht ■ log/ifc + max log A i>k + V t -i tk (4.9) 



Finally as we described in Background chapter, we retrieve the Viterbi path using the 
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equations 



VT 



argmax(Vr,i) 



(4.10) 



Vt-i 



(4.11) 



Biasing emission over transition 

Another idea that we adopted from MySong was to introduce a bias factor in the objective 
function of Viterbi. That way we can bias chords that have better emission probabilities 
that transition. The reason why want to do so is because that way we can control the 
variability of inferenced chords and thus of the predicted ones. For example if we assign 
more importance on emission probabilities then the system will become more sensitive to 
the melody just played instead of the chord sequence that is most probable. 

More formally, the way we paraneterise the importance is by introducing a factor in the 
objective function. 



, where O = Oq,0\, . . . ,Ot the sequence of the random variables that corresponds to 
observable melodic level (Oo is for bar 0, etc) and C = Co, C%, . . . , Ct chord progression. 

So far, we saw a method to infer the chord sequence that accompanies a given a history of 
pitch-class distributions. Next we will see the subsystem responsible for chord prediction. 

4.2.3 Chord prediction: Variable Order Markov model 

One of the limitations of 1-st order Markov models is that it can't catch higher order depen- 
dencies. So as to illustrate this limitation imagine a chord sequence that is Cmjaor, Gmajor, 
Cmajor, Aminor, Cmajor, Gmajor, Cmajor or using a sorter notation C,G,C,Am,C,G,C. A 
1-st order Markov model will give positive probability for the next chord to be either Am 
or G. However, a closer look will reveal that there is a pattern (that is C, G, C, Am). How 
can we extend the 1-st order Markov model so as to a) catch those dependencies and b) be 
as fast as possible to train and query ? 

Variable Order Markov models (VoM) as we saw are models that are used in lose less 
compression and can be used also for prediction, due to their connection to the first problem 
[2]. Those models allows us to answer questions such us what is the next symbol given a 
context ? Or more formally: 



L= (l-a)-logP(0|C) + a-logP(C) 



P(s\cr), where s is next symbol and a the context. 
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In our task, the question is "Given a chord sequence, what is next chord ?". Or more 
formally: 

P(chordt+i\chord%)), where chords is the chord sequence until t and chordt+i next chord 

For this thesis, we implemented a Variable Order Markov model (VoM) using a tree structure 
which capture variable order dependencies and allows us to 

• Train the model on-line (during performance) 

• Train and query really fast (at real-time) 





Figure 4.6: Training of an empty Variable Order Markov model with chord sequence CGCAm- 
CGC 
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Example 

Recalling the example we saw in the beginning of this section let's assume that we have the 
following chord sequence 

C,G,C,Am,C,G,C 

and initially an empty tree. We will show how to train the tree and try to predict next 
symbol. As we can see in figure 4.6 we construct the tree incremental. That is we start 
parsing the sub sequences from right to left. For example, initially we start with symbol C 
preceding symbol G. Thus we create a node C with pointer to position 2 of the sequence. 
The numbers in red are the pointers to the training sequence. 

Given that we have trained our model we need to predict next chord. More formally, the 
question we want to answer is 

P(chordt+i\chor(% = CGCAmCGC) 

To answer this question we start from the rightmost symbol of our conditional sequence 
and travel the tree until the node we are is leaf (that is, it doesn't have children). In our 
example we start with symbol C and we visit node G and the C (reverse order). As we 
can see, the pointer of that node is 4 which means that next chord is Am with probability 
P(chord t+1 = Amlchordl = CGCAmCGC) = 1. 

What if we wanted to answer the question P{chordt+i\C) ? Again, from the tree we can 
see that there are three pointers 2,4,6. That means that there is \ probability of each 
pointer to be the next chord. However you can see that pointer 2 and 6 point to chord G 
and pointer 4 to Am. Thus, this gets translated to | for G to be next chord and | for 
Am. If we want to return the chord that maximise the probability we will have to count 
each pointer endpoints and return the most pointed chord. The reason why we use pointers 
instead of chords is because we might want to store more informations such as duration of 
the chord or stylistic options such as adding a 9th interval. Thus, this decision improves 
the expandability of our system. 

4.2.4 Hybridation 

As we saw our system consist of an inferencer and a predictor. We have already describe 
how each component work and we will proceed to the description of the main prediction 
algorithm. 

Every time a bar finish, the system computes the discrete pitch class distribution of the 
melody and call prediction algorithm with that as input. The prediction algorithm append 
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Figure 4.7: Example of query 



the observation x to an observation list which contains the observation history until then. 
Using that history, it calls Viterbi algorithm and infer a chord sequence that best explains 
that observation history. The chord sequence then is used to query the Variable order 
Markov model (VoM) so as to fetch next possible chord (prediction). We are ready now to 
play that chord and update the Variable order Model with the observation history. 

Algorithm 2 Next_chord_prediction 
input: bar note histogram x 
output: next chord 

append(observations, x) 

chord .sequence viterbi(observations) 

predicted -chord <— V oM _query (chord sequence) 

output(predicted_chord) 

V oM _train(chord sequence) 



Zero frequency 

One of the problems we had to deal with was the case where the Variable order Markov 
Model (VoM) doesn't have information about the next chord. This is possible mostly at the 



38 



Chapter 4 Design and implementation 



beginning of each session were there are not enough data to train the system, or the jam 
session is not very stable - that is, there is no structure or chord progression pattern. In 
that case, VoM would return the null value, which means that no prediction can be made. 
This is problem is known in literature as 'Zero frequency problem'. 

Our strategy to resolve this situation was the following. In the event of a zero-frequency 
we will use the information learned offline from the corpus. As we saw, this information is 
stored in the Hidden Markov Model in transition probabilities. This is equivalent to having 
trained the Variable Order Markov model with the corpus using as maximum allowed order 
equal one. To summarise, query our tree for a prediction, and if this fails we search for the 
chord that maximise the probability 

chordt+i = arg max P(chordt+i = i\chordt = j) 

i 

where j is the last played chord as inferenced by the Hidden Markov Model. 
4.2.5 Dataset and Training 

So far we have seen a description of the system and it's subsystems. However, one of the 
fundamental problems of most of machine learning problems is how the training occur. As 
we said, the idea of this thesis is to simulate the way a new musician learn to play and 
improvise music. That is, with practise with transcribed music. Thus, our initial aim was 
to find transcriptions of musical tracks. 

Our first choice was to start with MIDI files from well known jazz tracks. Real-Books are 
books that have been created so as to help jazz musicians to learn and improvise known 
jazz songs. The process of improvising a jazz song was known in the jazz slang as faking. 
Our initial data set consisted of 274 MIDI real-book's files with melody and chord track. As 
we saw, MIDI files contains the raw information of a note and the time that played without 
any other musical information such as key or chord name. That was a great problem to 
our approach since we had to find a way to annotate groups of concurrent played notes 
to chords. After researching we noticed that it was a problem of itself ([15]) that might 
introduce several errors in our method. For example, if we annotate wrongly several chords 
then the error would be propagated to the learner and our final system won't be reliable. 

For this reason, we abandoned the use of MIDI files and we looked for alternatives. MusicXML 
are files that contain transcribed music, annotated with musical informations such as chord 
name, key and time signature, etc. Thus, we found that such data will be sufficient to train 
our model. Our data set was fetched from the wiki style website wikifonia 1 

Unfortunately, wikifonia did not provide the musicxml files in an archive to download, so we 
^lttp:/ /www. wikifonia.org 
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Find chord sequence which maximise 

L = log P(chords) + log P(melody\chords) 

where, 

logP(chords) = log P start (chord ) + X)'=i -P( c i| c i-i) 
log P(melody\chords) = Xw=" ' log-P(^!,c;) 
logPOIc) oc ^fc log u Ck 



A ...and then predict next chord 
P(c(+i|cftords) , chords = c t , . 

where P is the Variable Order Markov 
model 
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Variable Order Markov model (VoM) 

Training complexity: 0(t 2 ) 
Query complexity: 0(t) 



Hidden Markov model (HMM) 

Viterbi complexity: 0(t\chords\ 2 ) 



Figure 4.8: The complete system 
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developed a web crawler in python. Using our crawler we fetched 3000 musicxml files. After 
inspecting them we noted that the files contained several pieces whose annotation violated 
our assumptions. For example, one of our assumptions was that chord change event can 
occur on each bar. However, there were pieces that more that two chord changes occurred 
in one bar. Thus, we had to develop python scripts so as to filter the data set. 

The filtering scripts we developed filtered the 3000 musicxml files as follows. As we already 
said we didn't want files that contained chord changes more that twice in a bar. What is 
more, another assumption we had was that no key modulation is allowed. However, we 
decided that musical pieces with key modulation will not be filtered out but preprocessed 
and used in the training. The preprocessing we applied was to transform each key mod- 
ulation in C key. Although that might change the musical feeling of the song we noticed 
that localy, musical statistical elements remain the same (such as the distribution of pitch 
classes relative to the key). 

Another issue we had to deal with was the amount of different chords and different types 
of chords. For example, in our data set we might have a Cmajor chord and a Cmaj9, 
which means the both chord are major chord but the one has also a 9th interval added. 
However, usually those additional notes in a chord are due to stylistic preferences of the 
player and not part of the compositions. Thus, we decided to simplify chords to the chord 
representation we already discussed. That way we keep the information needed for the 
probabilistic harmonisation and also we decrease the zero-frequency effect. Again, for the 
chord simplification we developed scripts in python which get as input the musicxml file 
and export a filtered and processed version of the file. 

Finally, for the training of the system we developed scripts in python for counting chord 
transition frequencies and initial chord probabilities. The output of those scripts were 
text files containing transition, initial and emission probabilities. Those files are given to 
the Hidden Markov Model as input. In figure 4.9 we can see the heat map of transition 
probabilities as trained from our data. 
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Figure 4.9: chord transition probabilities of HMM heatmatp 



4.2.6 Implementation 

Our system was of great importance to ran as fast as possible. What is more, given the 
limited time, we wanted a system with which we will be able to develop fast and ideally 
integrated with existing software of real-time music processing. Given those constraints we 
decided to develop our system on Max/MSP language. 

What attracted us to this language was the fact that we could route MIDI and audio signals 
through filters like flowcharts. That way it was easy to prototype ideas and change filtering 
and routing of signals during performance. What is more, it is extendable in a mean that 
we can develop plugins (called externals in Max/MSP terminology) using languages such 
as C, C++ or Java. 

For this project we developed a Max/MSP external. Our whole system was developed using 
Java and integrated into Max/MSP. After experimenting with several Hidden Markov Model 
implementations we decided that the best was to develop it from scratch. The reason we 
decided so was because of the customisability of such choice. What is more, the time it 
would take us to customise a ready library will be more than developing the library from 
scratch. For the same reasons we also developed from scratch the Variable Order Markov 
model. The final external consists of two inputs and one output. The first input takes 
the notes played by the musician and the second input waits for a signal that the bar has 
changed. Thus, each time that bar changes, notes gathered from the input are grouped to 
bars and the processed (e.g. pitch distributions ,etc). What is more, the predict algorithm 
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is executed and the predicted chord is the outputted. 

As we said, the whole system was developed as a Max/MSP external with two inputs and 
one output. Then, Max/MSP allowed us to route midi signals to the system and also route 
the output to midi devices (such as a midi synthesiser). Further more, Max/MSP enabled 
us to integrate our system with the commercial live performance software Ableton Live. 
That way, we could use our system with the state-of-art software in music live performance. 

The whole system can be downloaded and used from http://ptigas.com. 
4.2.7 Conclusion 

In this chapter we saw the complete description of our system, design decisions we made and 
why, and description of our implementation. In the following chapter we will see the different 
settings we used in our experiments and the results which show the main contribution of 
this thesis. 



Chapter 5 

Testing and evaluation 

Good art is not what it looks like, but what it does to us. 

Roy Adzak 



5.1 Introduction 

In this chapter we present the experimental settings we used and the evaluation we utilised 
to measure and compare the performance of our system. Also, we present the results of the 
experiments and the evaluation methods. 

5.2 Objective evaluation 

As we have already discussed we had to evaluate our system using objective criteria. Re- 
calling the aim of this project, we had to develop a system that predict chords at real-time. 
Thus we had to measure the response of the time and the accuracy of the prediction. 

5.2.1 Time performance 

We designed our system so as be as fast as possible. The time complexity of both Hidden 
Markov Model and Variable Order Markov model was small enough to response at real- 
time. However, complexity by itself is not enough since the actual performance depends on 
architecture and several other things (bottleneck, scheduling, etc). Thus, we had to measure 
the actual time it needs the system to respond - that is predict a new chord. For this, we 
developed simple Max/MSP externals which allowed us to measure the time between the bar 
change and the availability of the prediction. We measured the performance of the system 
during a rehearsal of duration of 10 minutes. That time was enough for an improvisation 
to complete and also a reasonable amount of time to measure. In figure 5.1 we can see the 
response of the system. Please note that response time decrease during playing since both 
chord sequence inference and prediction algorithm are sensitive to the size of history - that 
is the number of bars played. 
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8 - 




Figure 5.1: System's latency versus session's time 
5.2.2 Prediction accuracy 
Evaluation framework 

The most important objective of this project was to predict future chord using the history 
of the pitch distributions of the bars. What is more, our data set consisted of chord 
annotated transcriptions. Thus using the dataset as ground truth we could easily measure 
the prediction accuracy. 



For prediction accuracy measure of a test we used the percentage of chord predictions that 
were equal to the actual chord in the testing MusicXML. That is, 



T 

m(song) = L(chordt,predictiont) 

% 

where T is the number of bars of the test, chordt is the actual chord of test at bar t and 
prediction is the prediction at bar t using history of pitch probabilities for bars from to 
t- 1. 

What is more, we wanted to measure the performance on independent data so as to wee how 
good our system was in generalising. For this we used a method known as cross validation. 
This method splits data into n folds (foldi,...,fold n ). For i = l,...,n we repeat the 
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following. We use foldi as testing dataset and the rest folds for training. After training the 
system with data from training fold we measure the performance of each of the tests of the 
testing data set and then compute the average. Finally we average those averages for the 
final prediction accuracy of the system. In this thesis we used 10-fold cross validation. 

Experimental settings 

For the evaluation of out system we designed the following experiments. 

1) HMM inference accuracy 

In this experiment we aimed to evaluate the inference accuracy of the Hidden Markov Model 
and Viterbi algorithm. In other words, to measure how good is HMM for finding the most 
probable chord sequence given a history of bars' pitch distributions. It's easy to see that 
the prediction power of our system depends highly on how good it inferenced the chord 
sequence. 

2) HMM and VoM prediction accuracy (60 chords) 

In previous chapter we saw that we simplified chord representation so as to have 5 differ- 
ent chord types (minor, major, augmented, diminished, suspended) for each of 12 different 
root notes (C,C#,D,D#,E,F,F#,G,G#,A,A#,B). Thus in total we have 12 x 5 = 60 dif- 
ferent chords. In this exeriment we used all music pieces that are valid according to our 
assumptions and whose chords are simplified to those 60 chords. 

3) HMM and VoM prediction accuracy (7 chords) 

In the previous experiments we used as data set music pieces that contained chords from 
60-chord set. However, recalling music theory, the key and the scale information indicates 
which notes are most probable to occur. For example, if the key is C then most of the 
notes in the musical piece will be the notes that consist to C major scale. Those are 
C,D,E,F,G,A,B. If we use strict music theory then under the fact that we are in C major, 
only those notes are allowed. Under this constraint, we are allowed to use only 7 chords, 
called diatonic chords do to their connection to the key of the musical piece. 

We decided that this experiment will show how good is our system using the assumption 
that strict musical theory is applied. For this we filtered again our data set and kept only 
pieces that used as chords the 7 diatonic chords. 

After the filtering we ended with 235 MusicXML files that had only diatonic chords. This 
data set is then used for training and testing as discussed. 

4) HMM and VoM half-half 

We claim that our system's strength is that it can learn during the rehearsal and improve 
it's prediction ability. So as to measure this ability of our system we designed the following 
experiment. As we said we measure the prediction performance of our system on a song using 
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Cmajor 



Dminor 



Eminor 



Fmajor 



Gmajor 



Aminor 



Bdim 

Figure 5.2: 7 Diatonic chords of key C 



the formula m(song) = Y^J L(chordt, prediction t ) . In this experiment, instead of measuring 
the performance on all song, we measure it on the first and second half separately. That is 

T/2-1 

mi (song) = L(chordt, prediction^ 

i=0 

and 

T 

?722 (song) = L(chordt, prediction^ 

i=T/2 

The results of m\ and ni2 are saved into two separate files and processed to calculate p- values 
and overall prediction accuracy. 

5) HMM using transition matrix for prediction 

As we were designing the system, initially we used for prediction the transition probabilities 
as learned from the corpus. Thus future chords were predicted as follows : 

chordt+i = argmax P(c\chord t ) 
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This experiment actually measure the prediction power of a first-order Markov model trained 
off-line from corpus. The reason why we wanted to measure the performance of such model 
was because we wanted to test if the variability of the Markov model increased significantly 
the performance. 

6) Bayesian Band 

Finally, the last experiment was to test the Bayesian Band system. For this, we used the 
system as it was trained by them. The data set we used for testing were the 235 MusicXML 
files which contained only diatonic chords. 

Hypotheses testing 

We obtained p-values for the following two null hypotheses. 

HI: BayesianBand performs better that HMM+VoM (our system). 

and 

H2: Our implementation's prediction accuracy is not getting improved during the rehearsal. 
For HI and H2 we did an one-sided paired t-test and computed p-values as follows: 

HI: We kept the accuracy scores as calculated in two separate files. One for our implemen- 
tation (HMM with 7 chords) and one for Bayesian Band. We decided to use the 7-chord 
implementation since also their implementation use only 7-chords and thus the comparison 
is more fair. What is more, we didn't train their system on our data, but we tested it 
against our dataset. After the calculation of those two score files we used R programming 
language 1 and executed paired t-test (look appendix). The alternative hypothesis was that 
our system is better than BayesianBand. 

H2: Using the same workflow we split the prediction accuracy evaluation for the first and 
second halves of each song. That way we measured if the first half of each song contributed 
to the improvement of prediction for the next half of each song. The alternative hypothesis 
was that our system gets improved during rehearsal. 

Results 

For the experiments we developed bash scripts that execute and measure the performance 
of the system for each experimental settings we discussed. The results obtained can be seen 
in figure 5.3 



1 Programming language for statistical calculations and complex graphics plotting 
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Setting 

HMM inference 

HMM+VoM (60 chords) 

HMM+Vom (7 chords) 

HMM+Vom (7 chords) first half 

HMM+Vom (7 chords) second half 

HMM with prediction using transition probabilities 

Bayesian Band 



Performance 

68.028% 

34.46% 

49.24% 

47.8% 

50.67% 

48.76% 

28.03% 



Figure 5.3: Results of the experiments 



Hypothesis one-sided paired t-test p-value 

Bayesian Band performs better than HMM+VoM 8.105 * 10 -7 
HMM+VoM prediction accuracy doesn't get im- 0.0235 
proved during rehearsal 

Figure 5.4: p- values of hypotheses testing 



5.3 Subjective evaluation 

Evaluating creative systems only with objective criteria is like ignoring the artistic aspect 
of such systems. Interactive music systems, as creative systems, aim to output music and 
music is defined by it's audience. We don't aim to argue on what is music and what is not, 
however, the musical and artistic aspect of our system cannot be evaluated using other than 
humans. 

For this, we developed questionnaire and asked from the participants to rate the musical 
result of our system and of Bayesian Band. In detail, we used two compositions and one 
improvisation. Although, we understand that this is not improvised music, we defined 
improvise music as music that is unknown to the system ; that is, it hasn't been trained 
with. 

The three composition used are : 

• Mozart - Andante in C 

• Beatles - Hey jude 

• Improvisation in C key. 

We recorded each system rehearsing with each composition. For this experiment we allowed 
only 7 diatonic chords because of the nature of BayesianBand system and because we wanted 
to be as fair as possible. 
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We wanted to keep the questionnaire as simple as possible so as the participant not to get 
tired and to decrease any possible bias. For each question we ask the to answer using a 
simple Likert scale (0 to 4) which is a psychometric scale extensively used in questionnaires. 
The first question asked was the music skills of the participant. The answers for this question 
were for no training and 4 for professional musician. After this initial question at each 
page we have the audio recordings for each system for the same composition. Below each 
audio clip we ask the participant to rate the quality of the chords. The scale used was 
for very bad, 1 for bad, 2 for neutral, 3 for good and 4 for very good. We developed three 
such pages, one for each composition. The order of the recordings is randomised so as to 
decrease bias introduced by the recognition of some patterns or trends in chords prediction. 
The following figure presents the raw results of the questionnaire and in next chapter we 
will analyse the results. 

The questionnaire and audio samples can be found in http://ptigas.appspot.com. Also in 
the appendix you can find raw results of the questionnaire. 

p-value 0.005348 

BayesianBand 

HMM +VbM , , 



p-value 0.5 

p-value 0.2826 




12 3 
compositions 

Figure 5.5: Average of the questionnaire's ratings per system per composition 

Again we used hypothesis testing with one-sided paired t-test with alternative hypothesis 
that our system performs better than BayesianBand. As we can see, the results in first and 
second composition doesn't show any significance improvement. Our observation was that 
in both systems chords showed great variance. For example, in our system chords didn't 
change frequently and there were cases that it didn't sound very well. In BayesianBand 
however we noticed the following interesting behaviour. The probabilities of staying in the 
same chord were very small and thus at every bar it tended to change chord. That sounded 
more interesting (a participant comment that behaviour "playful") but not always correct. 

Also, keep in mind that each participant had different taste and thus sometimes they pre- 
ferred our system and some other times BayesianBand. In the third composition the answers 
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showed that participants preferred our system with p-value 0.005348 < 0.05 which was a 
significant improvement. Our interpretation of this result was that the melody was poly- 
phonic and thus at each bar the pitch distribution was more informative and thus emission 
probabilities were calculated better. The result was that our system sounded better and in 
our subjective opinion chords were more accurate. 




Figure 5.6: ratings per skill for composition 1 



Bay esi an Band 




■ 



12 3 4 

participants music skills (0 no training - 1 professional musician) 



Figure 5.7: ratings per skill for composition 2 




Figure 5.8: ratings per skill for composition 3 



Chapter 6 

Discussion, conclusions and future 
work 

If you develop an ear for sounds that are musical it is like 
developing an ego. You begin to refuse sounds that are not musical 
and that way cut yourself off from a good deal of experience. 

John Cage 

In this chapter we will evaluate to what extend we achieved our goals. Using the experimen- 
tal results from previous chapter we will critically discuss about the objectives fulfilment 
and draw a conclusion of this thesis. Finally, the directions we aim to extend our work in 
the future will be referred. 

6.1 Aims achievement 

In the Introduction of this thesis we mentioned the aims and objectives of this project. 
The main goal was to develop a system that is capable of providing accompaniments to an 
improvising musician during the rehearsal in real-time. Our results indicates that our system 
predict chords performing better than existing software (BayesianBand) and moreover works 
in real-time. 

As we saw in previous chapter (figure 5.3) the prediction accuracy of our system was 34.46% 
for 60 chords and 49.24% for 7 chords. That indicates that there was an improvement of 
the performance as we reduced the set of chords to the diatonic chords. One of the reasons 
why this happened was that less chords implied less choices in chord prediction. Indeed, it's 
easy to see that the probability of a single chord prediction to be accurate is 1/7 whereas 
with 60 chords is 1/60. What is more, diatonic chords have the property that they are the 
most probable chords to occur during a musical piece. Thus, thus, transition probabilities 
to diatonic chords tend to be greater. When we test our system using 60 chords chances are 
the we won't be able to predict accurately non-diatonic chords, due to the small probability 
of transition to such chords. What is more, we noticed that there were diatonic chords 
that had similar chord-profile with non-diatonic chords. Thus, although we might have a 
big emission probability for a non-diatonic chord, again, the transition probability bias the 
preference of our system to a diatonic chord. 
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The performance of 7-chord system shows that it outperforms BayesianBand system in 
the objective evaluation which had a prediction accuracy of 28% on the same dataset. 
Both systems use 7 diatonic chords thus in our opinion the comparison was fair. As we 
mentioned in previous chapter, we tested the null hypothesis that our system performs worse 
than BayesianBand in objective evaluation. More formally, the null-hypothesis was that the 
mean difference of the performances of our system versus BayesianBand's performance is 
less than zero. One-sided paired t-test showed that with p- value = 8.105 * < 0.05 we 
rejected null hypothesis and thus we had a significant improvement 1 . 

One of the greatest achievements of this project was that the system is capable of learn- 
ing during the performance (online training) and improving it's prediction accuracy. The 
prediction accuracy of the first halves of the songs is measured 47% and for the second 
halves 50.67%. Thus, our results shows that the first half of each song contains informa- 
tion that our system capture and thus the prediction for the second half of each song gets 
improved. What is more, the improvement of the performance was significant with p-value 
0.0235 < 0.05. 

Also, our subjective evaluations showed that our system doesn't perform worse than Bayesian- 
Band. What is more, in one of the compositions we outperformed BayesianBand signifi- 
cantly with p-value 0.005348 < 0.05. Those results seem promising although further im- 
provements can be done. 

Finally, one important characteristic of our system is that it has to work in real-time. As we 
saw in previous chapter, our system indeed is capable of providing prediction with latency 
< 0.06s on a 2.0 GHz core 2 duo computer with 2GB RAM. That indicates that our system 
can indeed work in a real-time context for a simple 10 minute improvisation, although 
further improvements can be applied. The latency increase during time, which is expected 
since both complexities of Viterbi algorithm and training of Variable Order Markov model 
are functions to number of bars played. 



6.2 Future work 



One of our low priority goals was to develop a system that will actually deal with audio 
and not only symbolic music (MIDI). In literature there have been many tries to solve this 
problem however none of which solved this problem efficiently. What is more, transcribing 
or pitch tracking an audio source is a problem on it self. Thus, it is expected that the im- 
plementation of such system would decrease the performance since the error is propagated. 
Due to limited time, although we developed a system that deals also with audio source, we 
didn't evaluated and thus didn't described in detail. 



1 The alternative hypothesis was that our system is more accurate than BayesianBand 
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Chapter 6 Discussion, conclusions and future work 



In [26], Adam Start et al presented the idea of retrieving a chromogram from an audio 
source in real-time. A great advantage of our design was that we could use those chromo- 
grams instead of pitch tracking to retrieve pitch list and then computing again pitch class 
distributions. What is more, this allowed us also to deal with polyphonic audio source and 
not only monophonic. Then we route those chromograms to our system the same way we 
did we pitch distributions, and it's easy to see that they are the same. The author of this 
thesis performed with such system and found that if the source was noise-free enough then 
it predicted some pleasing chords. However, we understand that a proper evaluation of such 
prediction accuracy is needed. Thus, one of our future goals is to improve the system and 
evaluate it again with both subjective and objective evaluation methods. 

Another issue that we aim to work on in the future is to experiment with a more complicated 
model, called Variable Order Hidden Markov model. Since it is relative new model in the 
field of machine learning, it hasn't been applied extensively, although when it got applied 
it's results where promising. To our surprise, a recent research from Georgia tech used 
this model for predicting Table keystrokes in Indian music [7]. Our aim is to extend and 
compare to the current implementation. 

Finally, we noticed that our current implementation was very memory consuming. That 
was caused by the nature of Variable Order Markov model and primary by the fact that we 
didn't make any optimisation to reduce memory usage. However, it's easy to see that our 
implementation is very similar to a suffix tree. What is more, there exist data structure that 
implements a Variable Order Markov model that is called probabilistic suffix tree and one 
of our aims is to use a suffix array data structure which is known for it's efficient memory 
complexity [13]. 



6.3 Conclusions 



Our main and most important goal was to implement a system that predict accompaniments 
in a jam-session in real time. For this we developed a system that achieves that goal 
to the extend that a summer project allowed us. Our results indicates that indeed our 
implementation can run in real time and predict chords with very good prediction accuracy. 
What is more, in the objective evaluation it outperforms current system called Bayesian 
Band. Also, our implementation shows that it is capable of learning underlying structure 
of the improvisation and utilise this knowledge for improving chord prediction. Finally, our 
design allows us to extend further the system so as to use polyphonic audio noise- free source 
(e.g. dry signal of a guitar). 

In the previous chapters we also reviewed some of the state-of-art interactive music systems 
(Chapter 2). What is more, in chapter 3 we reviewed the graphical models and analysed 
the Hidden Markov Model and Variable Order Markov model that we used in this thesis. 



6.3 Conclusions 
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In chapter 4 we described our implementation along with the decisions we made during 
this project. Additionally, in chapter 5 we analysed the evaluation methods we used in 
this project and we presented our results. Finally, in chapter 6 we critically evaluated our 
results, we gave some directions for further work and we concluded to which extend our 
aims and goals got achieved. 

Last but not least, this thesis fulfilled the ultimate goal of the author which was to combine 
computer science, mathematics and music. As a byproduct of this thesis was the better 
understanding of how humans percept music and act during improvisation. As a last com- 
ment on this work, we would like to use Johny Cage words 

"If you develop an ear for sounds that are musical it is like developing an ego. You begin 
to refuse sounds that are not musical and that way cut yourself off from a good deal of 
experience. " 

Interpreting this in a machine learning context, always such system will reflect the aesthet- 
ics of the author and data used for training. However, during this thesis we gained the 
confidence to say that it's not far the future where humans and machines will be able to 
collaborate for creating art, although still there are elements of human perception that must 
be incorporated. 
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Questionnaire raw results 
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Figure A.l: Raw questionnaire results. Cj stands for composition i, BB for BayesianBand and 
HMM+VoM for our system. 
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